{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Data manipulation\n",
                "This notebook provides necessary steps to generate DKN's input dataset from the MAG COVID-19 raw dataset "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [],
            "source": [
                "import os \n",
                "import codecs\n",
                "import pickle\n",
                "import time \n",
                "from datetime import datetime  \n",
                "import random\n",
                "import numpy as np\n",
                "import math\n",
                "\n",
                "from utils.task_helper import *\n",
                "from utils.general import *\n",
                "from utils.data_helper import *\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Preparing paper related files\n",
                "First let's generate data for papers. \n",
                "For DKN, the paper data format is like: <br>\n",
                "`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]` <br>\n",
                "where w and e are the indices of words and entities sequence of this paper. \n",
                "Words and entities are aligned. To take a quick example, a paper with title is:  <br> `One Health approach in the South East Asia region: opportunities and challenges` <br> \n",
                "Then the title words value can be <br>  `101,56,23,14,1,69,256,887,365,32,11,567` <br>   and the title entitie value can be: <br>  `10,10,0,0,0,45,45,45,0,0,0,0` <br>  The first two values of entities sequence is 10, indicating that these two words corresponding to the same entity. The title value and entity value is hashed from 1 to n and m(n/m is the number of distinct words/entities). "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [],
            "source": [
                "InFile_dir = 'data_folder/raw'\n",
                "OutFile_dir = 'data_folder/my'\n",
                "create_dir(OutFile_dir)\n",
                "\n",
                "Path_PaperTitleAbs_bySentence = os.path.join(InFile_dir, 'PaperTitleAbs_bySentence.txt')\n",
                "Path_PaperFeature = os.path.join(OutFile_dir, 'paper_feature.txt')\n",
                "\n",
                "max_word_size_per_paper = 15 "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Step 1 is to hash the words and entities. <br>\n",
                "For simplicy, in this tutorial we only use the paper title to repsesent the content of paper. Definitely you can use more content, such as paper abstract and paper body. <br>\n",
                "Each feature length should be fixed at k (max_word_size_per_paper), if the number of words in document is more than k, we will truncate the document to k words. If the number of words in document is less than k, we will pad 0 to the end. "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "loading file PaperTitleAbs_bySentence.txt...\n",
                        "loading line: 880000, time elapses: 10.1s  \n",
                        "parsing into feature file  ...\n",
                        "parsed paper count: 110000, time elapses: 0.5s \n"
                    ]
                }
            ],
            "source": [
                "word2idx = {}\n",
                "entity2idx = {}\n",
                "relation2idx = {}\n",
                "word2idx, entity2idx = gen_paper_content(\n",
                "    Path_PaperTitleAbs_bySentence, Path_PaperFeature, word2idx, entity2idx, field=[\"Title\"], doc_len=max_word_size_per_paper\n",
                ")\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Step 2 is to generate the data of the knowledge graph, in turns of a set of triples: <br>\n",
                "`head, tail, relation` <br>"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "processing file RelatedFieldOfStudy.txt... done.\n"
                    ]
                }
            ],
            "source": [
                "word2idx_filename = os.path.join(OutFile_dir, 'word2idx.pkl')\n",
                "entity2idx_filename = os.path.join(OutFile_dir, 'entity2idx.pkl')\n",
                "\n",
                "Path_RelatedFieldOfStudy = os.path.join(InFile_dir, 'RelatedFieldOfStudy.txt')\n",
                "OutFile_dir_KG = os.path.join(OutFile_dir, 'KG')\n",
                "create_dir(OutFile_dir_KG)\n",
                "\n",
                "gen_knowledge_relations(Path_RelatedFieldOfStudy, OutFile_dir_KG, entity2idx, relation2idx) "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The data files will be outputed to the folder `OutFile_dir_KG`.  <br>\n",
                "To train word embeddings, we need a collection of sentences:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "loading file PaperTitleAbs_bySentence.txt...\n",
                        "loading line: 880000, time elapses: 8.8s "
                    ]
                }
            ],
            "source": [
                "Path_SentenceCollection = os.path.join(OutFile_dir, 'sentence.txt')\n",
                "gen_sentence_collection(\n",
                "    Path_PaperTitleAbs_bySentence,\n",
                "    Path_SentenceCollection,\n",
                "    word2idx\n",
                ")\n",
                "\n",
                "## save the id mapper\n",
                "with open(word2idx_filename, 'wb') as f:\n",
                "    pickle.dump(word2idx, f)\n",
                "dump_dict_as_txt(word2idx, os.path.join(OutFile_dir, 'word2id.tsv'))\n",
                "with open(entity2idx_filename, 'wb') as f:\n",
                "    pickle.dump(entity2idx, f)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Prepare user related files\n",
                "Next we generate user related files.\n",
                "Our first task is user-to-paper recommendations. For each user, we collect his/her complete cited papers, and arrange them in chronological order. The recommendation task can then be formulated as: given a user's citation history, to predict what paper he/she will cite in the future."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "loading PaperAuthorAffiliations.txt...\n",
                        "loading Papers.txt...\n",
                        "loading PaperReferences.txt...\n",
                        "parsing user's reference list ...\n",
                        "parsed user count: 430000, time elapses: 3.6s \n",
                        "outputing author reference list\n"
                    ]
                }
            ],
            "source": [
                "\n",
                "_t0 = time.time()\n",
                "\n",
                "Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')\n",
                "Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')\n",
                "Path_Papers = os.path.join(InFile_dir, 'Papers.txt')\n",
                "Path_Author2ReferencePapers = os.path.join(OutFile_dir, 'Author2ReferencePapers.tsv')\n",
                "\n",
                "author2paper_list = load_author_paperlist(Path_PaperAuthorAffiliations)\n",
                "paper2date = load_paper_date(Path_Papers)\n",
                "paper2reference_list = load_paper_reference(Path_PaperReference)\n",
                "\n",
                "author2reference_list = get_author_reference_list(author2paper_list, paper2reference_list, paper2date)\n",
                "\n",
                "output_author2reference_list(\n",
                "    author2reference_list,\n",
                "    Path_Author2ReferencePapers\n",
                ")\n",
                "\n",
                "OutFile_dir_DKN = os.path.join(OutFile_dir, 'DKN-training-folder')\n",
                "create_dir(OutFile_dir_KG)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### DKN takes several more files as inputs:\n",
                "- training / validation / test files: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> \n",
                "`[label] [userid] [CandidateNews]%[impressionid] `<br> \n",
                "e.g., `1 train_U1 N1%0` <br> \n",
                "- user history file: each line in this file represents a users' citation history. You need to set his_size parameter in config file, which is the max number of user's click history we use. We will automatically keep the last his_size number of user click history, if user's click history is more than his_size, and we will automatically padding 0 if user's click history less than his_size. the format is : <br> \n",
                "`[Userid] [newsid1,newsid2...]`<br>\n",
                "e.g., `train_U1 N1,N2` <br> \n",
                "\n",
                "DKN take recommendations as a binary classification problem. We sample negative instances according to item's popularity:\n",
                "<img src=\"https://recodatasets.z20.web.core.windows.net/kdd2020/images/item-popularity.JPG\" width=\"600\">"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "expanding user behaviors...\n",
                        "processing user number : 287000, time elapses: 1.7s done. \n",
                        "sample number in train / valid / test is 150874 / 8198 / 8198\n",
                        "negative sampling for train...\n",
                        "sampling process 0:  150000 / 150874, time elapses: 28.3s                                                                                                                                                                                                                                                                                                         \tsampling process 1 done.\n",
                        "\tsampling process 0 done.\n",
                        "negative sampling for validation...\n",
                        "sampling process 1:  8000 / 8198, time elapses: 1.5s                \tsampling process 0 done.\n",
                        "\tsampling process 1 done.\n",
                        "negative sampling for test...\n",
                        "sampling process 1:  8000 / 8198, time elapses: 1.6s                \tsampling process 0 done.\n",
                        "\tsampling process 1 done.\n",
                        "done.\n",
                        "time elapses for user is : 51.8s\n"
                    ]
                }
            ],
            "source": [
                "gen_experiment_splits(\n",
                "    Path_Author2ReferencePapers,\n",
                "    OutFile_dir_DKN,\n",
                "    Path_PaperFeature,\n",
                "    item_ratio=0.1,\n",
                "    tag='small',\n",
                "    process_num=2\n",
                ")\n",
                "\n",
                "_t1 = time.time()\n",
                "print('time elapses for user is : {0:.1f}s'.format(_t1 - _t0))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Prepare item2item recommendation dataset\n",
                "Our second recommendation scenario is about item-to-item recommendations. Given a paper, we can recommend a list of related papers for users to cite.\n",
                "Here we use a supervised learning approach to train this model. Each instance is a tuple of <paper_a, paper_b, label>. Label = 1 means the pair is highly related; otherwise the label will be 0.\n",
                "The positive labels are constructed in the following three ways: <br>\n",
                "1. Paper A and B overlap a lot in their reference list; \n",
                "2. Paper A and B are co-cited by many other papers;\n",
                "3. Paper A and B are published in 12 months by the same author (first author)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "loading PaperReferences.txt...\n",
                        "process paper num 53400 / 53452...time elapses: 8.8s\tDone.\n",
                        "process paper num 73600 / 73699...time elapses: 48.9s\tDone.\n",
                        "loading Papers.txt...\n",
                        "loading PaperAuthorAffiliations.txt...\n",
                        "process author num 435800 / 435822...time elapses: 1.0s"
                    ]
                }
            ],
            "source": [
                "OutFile_dir_item2item = r'data_folder/my/item2item'\n",
                "create_dir(OutFile_dir_item2item)\n",
                "Path_PaperFeature\n",
                "item_set = load_has_feature_items(Path_PaperFeature)\n",
                "\n",
                "\n",
                "Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')\n",
                "pair2CocitedCnt, pair2CoReferenceCnt = gen_paper_cocitation(Path_PaperReference)\n",
                "\n",
                "Path_paper_pair_cocitation = os.path.join(OutFile_dir_item2item, 'paper_pair_cocitation_cnt.csv')\n",
                "Path_paper_pair_coreference = os.path.join(OutFile_dir_item2item, 'paper_pair_coreference_cnt.csv')\n",
                "\n",
                "with open(Path_paper_pair_cocitation, 'w') as wt:\n",
                "    for p, v in pair2CocitedCnt.items():\n",
                "        if p[0] in item_set and p[1] in item_set:\n",
                "            wt.write('{0},{1},{2}\\n'.format(p[0], p[1], v))\n",
                "\n",
                "with open(Path_paper_pair_coreference, 'w') as wt:\n",
                "    for p, v in pair2CoReferenceCnt.items():\n",
                "        if p[0] in item_set and p[1] in item_set:\n",
                "            wt.write('{0},{1},{2}\\n'.format(p[0], p[1], v))\n",
                "            \n",
                "            \n",
                "Path_Papers = os.path.join(InFile_dir, 'Papers.txt')\n",
                "Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')\n",
                "paper2date = load_paper_date(Path_Papers)\n",
                "author2paper_list, paper2author_set = load_paper_author_relation(Path_PaperAuthorAffiliations)\n",
                "Path_FirstAuthorPaperPair = os.path.join(OutFile_dir_item2item, 'paper_pair_cofirstauthor.csv')\n",
                "first_author_pairs = gen_paper_pairs_from_same_author(\n",
                "    author2paper_list, paper2author_set, paper2date, Path_FirstAuthorPaperPair, item_set\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now let's separate the instances into training and validation set, and conduct negative sampling:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "negative sampling for file item2item_train.txt...\n",
                        "process line num 182500 / 182537...time elapses: 3.6s\tdone.\n",
                        "negative sampling for file item2item_valid.txt...\n",
                        "process line num 45600 / 45613...time elapses: 0.9s\tdone.\n"
                    ]
                }
            ],
            "source": [
                "split_train_valid_file(\n",
                "    [Path_paper_pair_cocitation, Path_FirstAuthorPaperPair, Path_paper_pair_coreference],\n",
                "    OutFile_dir_DKN\n",
                ")\n",
                "gen_negative_instances(\n",
                "    item_set,\n",
                "    os.path.join(OutFile_dir_DKN, 'item2item_train.txt'),\n",
                "    os.path.join(OutFile_dir_DKN, 'item2item_train_instances.txt'),\n",
                "    9\n",
                ")\n",
                "gen_negative_instances(\n",
                "    item_set,\n",
                "    os.path.join(OutFile_dir_DKN, 'item2item_valid.txt'),\n",
                "    os.path.join(OutFile_dir_DKN, 'item2item_valid_instances.txt'),\n",
                "    9\n",
                ")\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Generating the full dataset will take a longer time, let it run in the background freely..."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "expanding user behaviors...\n",
                        "processing user number : 287000, time elapses: 8.7s done. \n",
                        "sample number in train / valid / test is 1782333 / 125010 / 125010\n",
                        "negative sampling for train...\n"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "sampling process 1:  1014000 / 1782333, time elapses: 698.0s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     sampling process 2:  1774000 / 1782333, time elapses: 1207.6s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \tsampling process 3 done.\n",
                        "sampling process 6:  1777000 / 1782333, time elapses: 1210.3s                                   \tsampling process 0 done.\n",
                        "sampling process 6:  1780000 / 1782333, time elapses: 1211.8s                  \tsampling process 2 done.\n",
                        "sampling process 7:  1778000 / 1782333, time elapses: 1212.9s            \tsampling process 6 done.\n",
                        "sampling process 5:  1781000 / 1782333, time elapses: 1215.0s                  \tsampling process 1 done.\n",
                        "\tsampling process 7 done.\n",
                        "sampling process 5:  1782000 / 1782333, time elapses: 1215.5s  \tsampling process 5 done.\n",
                        "sampling process 4:  1782000 / 1782333, time elapses: 1220.2s          \tsampling process 4 done.\n",
                        "negative sampling for validation...\n",
                        "sampling process 4:  125000 / 125010, time elapses: 80.2s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \tsampling process 4 done.\n",
                        "sampling process 0:  125000 / 125010, time elapses: 80.5s  \tsampling process 7 done.\n",
                        "sampling process 3:  123000 / 125010, time elapses: 80.4s \tsampling process 0 done.\n",
                        "sampling process 3:  125000 / 125010, time elapses: 81.3s          \tsampling process 3 done.\n",
                        "sampling process 1:  125000 / 125010, time elapses: 82.3s        \tsampling process 1 done.\n"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "sampling process 5:  125000 / 125010, time elapses: 82.3s   \tsampling process 5 done.\n",
                        "sampling process 6:  125000 / 125010, time elapses: 83.7s      \tsampling process 6 done.\n",
                        "sampling process 2:  125000 / 125010, time elapses: 84.2s \tsampling process 2 done.\n",
                        "negative sampling for test...\n",
                        "sampling process 1:  125000 / 125010, time elapses: 81.9s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          \tsampling process 1 done.\n",
                        "sampling process 6:  125000 / 125010, time elapses: 83.0s                \tsampling process 6 done.\n",
                        "sampling process 5:  125000 / 125010, time elapses: 83.3s  \tsampling process 5 done.\n",
                        "sampling process 3:  125000 / 125010, time elapses: 83.5s  \tsampling process 3 done.\n",
                        "sampling process 7:  125000 / 125010, time elapses: 83.4s  \tsampling process 7 done.\n",
                        "sampling process 2:  125000 / 125010, time elapses: 83.8s \tsampling process 2 done.\n",
                        "sampling process 4:  125000 / 125010, time elapses: 83.9s \tsampling process 4 done.\n",
                        "sampling process 0:  125000 / 125010, time elapses: 85.2s   \tsampling process 0 done.\n",
                        "done.\n"
                    ]
                }
            ],
            "source": [
                "gen_experiment_splits(\n",
                "    Path_Author2ReferencePapers,\n",
                "    OutFile_dir_DKN,\n",
                "    Path_PaperFeature,\n",
                "    item_ratio=1.0,\n",
                "    tag='full',\n",
                "    process_num=8\n",
                ") "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python (kdd tutorial)",
            "language": "python",
            "name": "kdd_tutorial_2020"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.10"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}